Introduction

The objective of this project is to find a Free Agent player that meets all of the following criteria:

The steps that I take to find the best fit with the given objective can be in the following subsections.


Projecting Salary Model

First, I need to train an accurate model that projects free agent salaries of players in the NBA. The salary cap has increased overtime as shown in the plot below:


To force each historical contract signing to be on the same playing field, I will instead use contract salary cap % as the output variable instead of actual dollar worth of the contract. This makes the underlying function for the models to be:


\[\text{Salary Cap % ~ Basic/Advanced Player Stats + Free Agent Type (UFA/RFA) + Other Player Variables (Age)} \]


Upon mapping out the contract model and the response variable that I am trying to predict, I then set out to gather relevant player values that will be used in the model to project salary. The variables that I set out to collect are outlined in the following subsections.



Individual Player Statistics

The individual player statistics that I collected for the model were both:

  • Traditional Box Score Statistics (PPG, MPG, APG, etc.)
  • Advanced Statistics (VORP, PER, OWS/DWS, etc.)

I used the statistics for players in their last year of the contract to predict their new contract Cap %. The code to compile this data can be seen by clicking the code button below.

###################
##### PURPOSE #####
###################

# The Purpose of this chunk is to gather the individual player metrics introduced in the Rmd above. 
# I will use the ballr package in this process.

# Load in ballr package
library(ballr)

##################
## GATHER DATA ###
##################

## 1) Player Individual Metrics ##

# Data Gather Function - Individual Player Stats
player_stat_fetch <- function(x, type){
  
  #Constructing Output Data frame
  output_df <- if(type == "basic"){
    
  # Running on season selected - BASIC
  NBAPerGameStatistics(x) %>% 
      select(player, pos, tm, g, gs, mp, fga, fgpercent, 
             x3pa, x3ppercent, x2pa, x2ppercent, efgpercent, 
             fta, ftpercent, orb:pts)
    
  }else { 
    
  # Running on season selected - ADVANCED
  NBAPerGameAdvStatistics(x) %>% 
      select(player, pos, tm, per, tspercent, x3par, ftr, orbpercent:usgpercent, 
             ows, dws, ws, obpm, dbpm, bpm, vorp)
  
  }
  
  # Adding season 
  output_df$season <- x

  return(output_df)
  
}

## Loop through function ##

# BASIC
basic_player_stats <- lapply(c(2010:2020), player_stat_fetch, type = "basic")
basic_player_stats <- do.call(rbind, basic_player_stats)

# ADVANCED
advanced_player_stats <- lapply(c(2010:2020), player_stat_fetch, type = "advanced")
advanced_player_stats <- do.call(rbind, advanced_player_stats)

# View of output data frame and its variables 
glimpse(basic_player_stats)
glimpse(advanced_player_stats)

# Joining both and getting a clean final version #
player_stats_combined <- basic_player_stats %>% 
  inner_join(advanced_player_stats, by = c("player", "pos", "tm", "season"))

## FINAL EDITS TO DF ##
player_stats_combined <- player_stats_combined %>% 
  
  # Removing Basketball Reference * in names
  mutate(player = str_replace(player, '\\*', '')) %>% 
  
  # Filtering to get only TOT seasonal stats for traded players
  group_by(player, season) %>% 
  arrange(desc(g)) %>% 
  dplyr::slice(1)

####### END ########



Contract/Contract Type

Upon collecting individual player statistics, I then set out to collect data for the response variable being predicted - Cap % of a Free Agent’s new contract. Cap % was calculated by the following:


\[\text{Salary Cap %} = \frac{\text{Contract Yearly Value}}{\text{Season Total NBA Salary Cap}} \]


In addition to collecting yearly contract data, I also set out to collect data regarding the type of contract for each Free Agent. I hypothesize that Restricted Free Agents tend to generally have higher value contracts to sign them away from the team that can match any contract of the said free agent.

# Load needed packages
library(readxl)

### Read in Contracts data - gathered using spotrac.com ###
contracts <- read_excel("../SunsProject/Contracts.xlsx")

### Getting Yearly Cap - realgm.com ###
cap_data <- data.frame(
  Season = seq(2011, 2019, by = 1), 
  Cap = c(58044000, 58044000, 58679000, 63065000, 70000000, 94143000, 99093000, 101869000, 109140000))

### Caclulate cap percentage yearly ###
contracts <- contracts %>% 
  left_join(cap_data, by = c("Contract_start" = "Season")) %>% 
  mutate(cap_perc = Avg_yearly/Cap)



Final Modeling Data Frame

Finally, I join the two tables that were created above in to one condense dataframe that has individual statistics for players in their contract years as well as the contract value that the Free Agent signed. The code below produces the rds object named - model_data.rds that will be used later in the modeling process.

NOTE: I typically would us primary identifiers such as a playerid for my joins. Here I was forced to join by player name since there were no playerids in the collected data

## PROCESS TO GET NAMES TO BE THE SAME FOR EACH TABLE ##
# Would typically use nbapersonid or other primary identifiers for table joins.

# Getting mismatch names
mismatch_names <- contracts %>% 
  mutate(Player = iconv(Player,from="UTF-8",to="ASCII//TRANSLIT"), 
         Player = trimws(Player)) %>%
  
  left_join(
    
    player_stats_combined %>% ungroup() %>%
      
      # 1) Adding one to season to join to contract year
      mutate(player = iconv(player,from="UTF-8",to="ASCII//TRANSLIT"), 
             player = str_replace(player, "'", ""), 
             player = str_replace(player, '"', ""), 
             player = trimws(player)), 
    
    by = c("Player"="player", "Contract_start" = "season")
    
    ) %>% 
  
  filter(is.na(ws) == TRUE) %>%
  arrange(desc(Avg_yearly)) %>%
  distinct(Player) %>% pull(Player)

# Renaming mismatches
names <- data.frame(Player = mismatch_names[1:13],
                    player = c("Kristaps Porziņģis", "D'Angelo Russell", "Otto Porter", 
                               "Marcus Morris", "Patty Mills", "Nenê", "Tomáš Satoranský", "Jeff Green", 
                               "E'Twaun Moore",
                               "José Calderón", "Lou Williams", "Ish Smith",
                                "J.J. Hickson"))



player_stats_combined <- player_stats_combined %>% 
  left_join(names %>% mutate(Player = as.character(Player))) %>%
  ungroup() %>%
  mutate(player = ifelse(is.na(Player) == TRUE, player, Player)) %>% 
  select(-Player)

#######################
### CREATE FINAL DF ###
#######################

# Join the two dataframes
# table a - player_stats_combined
# table b - contracts

model_contract_data <- contracts %>% 
  mutate(Player = iconv(Player,from="UTF-8",to="ASCII//TRANSLIT"), 
         Player = str_replace(Player, "'", ""), 
         Player = str_replace(Player, '"', ""), 
         Player = trimws(Player)) %>%
  
  left_join(
    
    player_stats_combined %>% ungroup() %>%
      
      # 1) Adding one to season to join to contract year
      mutate(player = iconv(player,from="UTF-8",to="ASCII//TRANSLIT"), 
             player = str_replace(player, "'", ""), 
             player = str_replace(player, '"', ""), 
             player = trimws(player)), 
    
    by = c("Player"="player", "Contract_start" = "season")
    
    ) %>% 

  filter(is.na(ws) == FALSE)

# write to an rds
write_rds(model_contract_data, "../SunsProject/model_data.rds")



Variable Exploration

After collecting the necessary model data, I was ready to train a model in order to predict contract value of free agents. To do this I will utilize Boosting Models and xgBoost to predict yearly cap % of any free agent.

I first wanted to view the variables that I collected and their individual relation to the response variable - Cap Percentage. These relationships can be seen in the following visual. The variables shown in the correlation heatmap below are among the highest correlated with Cap Percentage. Other variables were removed due to high correlations with other independent variables.

NOTE: Other categorical variables included in the model such as Age and Free Agent Type (UFA/RFA) are not shown in the plot below.

#############################################
#### Creating a Correlation Plot for RMD ####
#############################################

# Read In Model Data
model_contract_data <- read_rds("../SunsProject/model_data.rds") %>% 
  
  # MUST PLAY MORE THAN 10 GAMES
  filter(g >= 10 & mp >= 10)

# Load in packages
# install.packages("ggcorrplot")
library(ggcorrplot)
library(ggplot2)
library(caret)

# Correlation Matrix 
corr <- model_contract_data %>% select(cap_perc, age = Age, mp:vorp) %>% 
  cor(use = "complete.obs")

# Getting most significant cors with cap_perc
sig_vars <- corr %>% as.data.frame() %>% rownames_to_column("Variable") %>% arrange(desc(cap_perc)) %>% 
  select(Variable, cap_perc) %>% 
  filter(cap_perc > 0.25) %>% 
  
  # Getting rid of non impact vars
  filter(!(Variable %in% c("obpm", "bpm", "x2ppercent", "x3pa", "fgpercent", "per", "x2pa")))

corr <- corr %>% as.data.frame() %>% 
  select(one_of(sig_vars$Variable)) %>% 
  rownames_to_column("Variable") %>% 
  filter(Variable %in% sig_vars$Variable) %>%
  arrange(desc(cap_perc))%>%
  column_to_rownames(var = "Variable") %>%
  as.matrix()

# Cor plot
# ggcorrplot(corr)

##############################################
#### Removing highly correlated variables ####
##############################################

# Run correlations
cor_mat <- model_contract_data %>% select(one_of(sig_vars$Variable)) %>% 
  select(-cap_perc) %>%
  cor(use = "complete.obs")

# Removing highly correlated ind. vars
index <- findCorrelation(cor_mat, 0.9)

#the name of the columns chosen above
to_be_removed <- colnames(cor_mat)[index]

#now go back to df and use to_be_removed to subset the original df
model_contract_data <- model_contract_data[!names(model_contract_data) %in% c(to_be_removed, "pf")]

##########################################
## REDOING WITHOUT CORRELATED IND. VARS ##
##########################################

# Correlation Matrix 
corr <- model_contract_data %>% select(cap_perc, one_of(sig_vars$Variable)) %>% 
  cor(use = "complete.obs")

# Renaming Vars for Rmd Pres
new_names <- c("Cap %", "VORP", "FGA", "OWS", "MPG", "FTA", "DWS", "TOV", "RPG", "STL", "USG %", "APG", "BLK", "TS %", "ORPG")

colnames(corr) <- new_names
rownames(corr) <- new_names

# Cor plot
ggcorrplot(corr, lab = TRUE, show.legend = FALSE)



Training the Model

I was now ready to use the variables seen in the previous section in a boosting modeling process coupled with k-fold cross validation. To do this, I used the package xgboost. The code to train this model is in the following chunk.

library(xgboost)
library(caTools)
library(pbapply)

#####################
##### OBJECTIVE #####
#####################

# 1) Get modeling data frame ready for boosting model creation
# 2) Train Boosting Model and tune parameters


#######################
#### 1) PREPROCESS ####
#######################

gbm_data <- model_contract_data %>% 
  
  # High Correl Variables
  select(Type, Position, Age, one_of(sig_vars$Variable)) %>% 
  
  #Scaling
  mutate_at(vars(vorp:orb), scale) %>% 
  
  #Guard/Forward/Center Translation
  mutate(Position = ifelse(Position %in% c("PG", "SG"), "Guard",
                                 ifelse(Position %in% c("SF", "PF"), "Forward", "Center"))) %>% 
  na.omit()


##########################
#### 1) TUNING PARAM. ####
##########################

# Training Set Split
set.seed(1998)
sample <- sample.split(gbm_data$cap_perc, SplitRatio = 0.8) 

train <- subset(gbm_data,sample == TRUE)
test<- subset(gbm_data,sample == FALSE)

# One hot encoding for categorical vars 
labels <- train$cap_perc
ts_label <- test$cap_perc
train2 <- model.matrix(~.+0, data = train[,-4])
test2 <- model.matrix(~.+0,data = test[,-4])

# Preparing xgb matrices
dtrain <- xgb.DMatrix(data = train2,label = labels) 
dtest <- xgb.DMatrix(data = test2,label=ts_label)

# Setting up grid of potential parameters
searchGridSubCol <- expand.grid(subsample = c(0.8,0.9,1.0), 
                                colsample_bytree = c(0.6, 0.8, 1),
                                max_depth = c(4,6,8,10),
                                min_child = seq(1), 
                                eta = c(0.05, 0.1, 0.15, 0.2, 0.25, 0.3))

ntrees <- 100

# Grid Search for tuning parameters
system.time(
  rmseHyperparameters <- pbapply(searchGridSubCol, 1, function(parameterList){
    
    #Extract Parameters to test
    currentSubsampleRate <- parameterList[["subsample"]]
    currentColsampleRate <- parameterList[["colsample_bytree"]]
    currentDepth <- parameterList[["max_depth"]]
    currentEta <- parameterList[["eta"]]
    currentMinChild <- parameterList[["min_child"]]
    xgboostModelCV <- xgb.cv(data =  dtrain, nrounds = ntrees, nfold = 10, showsd = TRUE, 
                             metrics = "rmse", verbose = TRUE, "eval_metric" = "rmse",
                             "objective" = "reg:linear", "max.depth" = currentDepth, "eta" = currentEta,                               
                             "subsample" = currentSubsampleRate, "colsample_bytree" = currentColsampleRate
                             , print_every_n = 10, "min_child_weight" = currentMinChild, booster = "gbtree",
                             early_stopping_rounds = 10)
    
    xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
    test_rmse <- tail(xvalidationScores$test_rmse_mean, 1)
    train_rmse <- tail(xvalidationScores$train_rmse_mean,1)
    output <- return(c(test_rmse, train_rmse, currentSubsampleRate, currentColsampleRate, currentDepth, currentEta, currentMinChild))
    }))

# Results of grid search  - df creation
gbm_results <- as.data.frame(t(rmseHyperparameters))
colnames(gbm_results) <- c("TestRMSE", "TrainRMSE", "SubSampRate", "ColSampRate", "Depth", "eta", "currentMinChild")

# Best combo
bst_results <- gbm_results %>% arrange((TestRMSE)) %>% dplyr::slice(1)

# Using tuned parameters
bstmodelcv <- xgb.cv(data =  dtrain, nrounds = 100, nfold = 10, showsd = TRUE, 
                         metrics = "rmse", verbose = TRUE, "eval_metric" = "rmse",
                         "objective" = "reg:linear", "max.depth" = bst_results$Depth, "eta" = bst_results$eta,                               
                         "subsample" = bst_results$SubSampRate, "colsample_bytree" = bst_results$ColSampRate,
                          print_every_n = 10, "min_child_weight" = bst_results$currentMinChild, booster = "gbtree",
                         early_stopping_rounds = 10, 
                      watchlist = list(val=dtest,train=dtrain))

bstmodel <- xgboost(data =  dtrain, nrounds = bstmodelcv$best_iteration, showsd = TRUE,
                    verbose = TRUE,
                         "objective" = "reg:linear", "max.depth" = bst_results$Depth, "eta" = bst_results$eta,                               
                         "subsample" = bst_results$SubSampRate, "colsample_bytree" = bst_results$ColSampRate,
                          print_every_n = 10, "min_child_weight" = bst_results$currentMinChild, booster = "gbtree",
                      early_stopping_rounds = 10)


# Feature importance
importance_mat <- xgb.importance(feature_names = bstmodel$feature_names, model = bstmodel)

#Write to rds for output to rmd in another code chunk
write_rds(importance_mat, "../SunsProject/importance_matrix.rds")
write_rds(bstmodel, "../SunsProject/gbm_contracts.rds")

Upon fitting a model to predict contract value for Free Agents, we can see the variable importance plot below.



Predicting on 2020 FA Class

With a model now trained, I needed to then obtain players and their respective variables for the 2020 Free Agent Class. In this way, I can approximate the contract value of each player in the 2020 class. I collected the data using the following code.

Once I collected the necessary data for the 2020 Free Agent class, I used the model to predict the contract value for each of the Forward Free Agents in this year’s class.

#####################
##### OBJECTIVE #####
#####################

# Obtain player stats and free agent type of each player in the 2020 FA Class

######################
##### DATA SETUP #####
######################

# 1) 2019-2020 Player Stats
upcoming_fa_stats <- player_stats_combined %>% filter(season == 2020) %>% 
  filter(mp >= 10 & g >= 10)

# 2) Read in 2020 Free Agents
free_agents <- read_excel("../SunsProject/Contracts.xlsx", sheet = "Sheet2") %>% 
  
  select(Player, Position, Age = AGE, Exp = EXP, Type = TYPE) %>%
  
  mutate(actual_type = Type, 
         actual_type = ifelse(actual_type == "PO", "Player Option", 
                              ifelse(actual_type == "CO", "Team Option", 
                                     actual_type)),
         Type = ifelse(Type %in% c("UFA", "RFA"), Type, "UFA"))

# Joining to get a df to predict on 
free_agents_final_df <- free_agents %>% 
  mutate(Player = iconv(Player,from="UTF-8",to="ASCII//TRANSLIT"), 
         Player = str_replace(Player, "'", ""), 
         Player = str_replace(Player, '"', ""), 
         Player = trimws(Player)) %>%
  
  left_join(
    
   upcoming_fa_stats %>% ungroup() %>%
      
      # 1) Adding one to season to join to contract year
      mutate(player = iconv(player,from="UTF-8",to="ASCII//TRANSLIT"), 
             player = str_replace(player, "'", ""), 
             player = str_replace(player, '"', ""), 
             player = trimws(player)), 
    
    by = c("Player"="player")
    
    ) %>% 

  filter(is.na(ws) == FALSE)


################################
##### CONTRACT PREDICTIONS #####
################################

# Set up dataframe in same way as modeling
free_agents_final_df <- free_agents_final_df %>% select(Player, Type, Position, Age,  one_of(sig_vars$Variable)) %>% 
  select(-pf) %>% 
  select(!(one_of(to_be_removed))) %>%
  mutate_at(vars(vorp:orb), scale) %>%
  mutate(Position_actual = Position, 
          Position = ifelse(Position %in% c("PG", "SG"), "Guard",
                                 ifelse(Position %in% c("SF", "PF"), "Forward", "Center")))

# Getting actual positions of players
actual_positions <- free_agents_final_df %>% 
  select(Player, Position_actual)

# Setting up model df
free_agents_final_df <- free_agents_final_df %>% 
  select(-Position_actual)
free_agents_mat <- model.matrix(~.+0, data = free_agents_final_df[-1]) 

# PREDICT
options(scipen = 999)
projected_contracts <- round(predict (bstmodel,free_agents_mat), 4)


predictions <- cbind(free_agents_final_df, projected_contracts) %>%
  
  left_join(free_agents %>% select(Player, actual_type) %>% 
                mutate(Player = iconv(Player,from="UTF-8",to="ASCII//TRANSLIT"), 
         Player = str_replace(Player, "'", ""), 
         Player = str_replace(Player, '"', ""), 
         Player = trimws(Player))) %>% 
  
  select(Player, Type = actual_type, Position, Cap_Perc = projected_contracts) %>% arrange(desc(Cap_Perc))  %>% 
  left_join(actual_positions) %>% 
  
  # Renaming/organizing columns
  select(Player, Position = Position_actual, Type, `Cap %` = Cap_Perc)

# Output to Rmd
library(DT)
DT::datatable(predictions %>% mutate(`Cap %` = percent(`Cap %`)), filter = 'top',
              
              options = list()) 


# Write to rds
write_rds(predictions, "../SunsProject/FA_predictions.rds")

2020 Forward Free Agents’ predicted annual contract values




Player Evaluation Dashboard

I deployed a Shiny Dashboard app that shows the Forward class at a glance. This dashboard can be found at


https://joesiwinski.shinyapps.io/SiwinskiProject/.



Final Player Recommendation

Upon viewing the shiny visualization dashboard and analyzing which players projected yearly salary fit the Exception, I recommend the team to sign Paul Millsap. Millsap not only is a plus defender and a plus catch and shoot player, he also brings a veteran presence to the locker room.